In this practical you will be answering a research question or solving a problem. For that you will create a pipeline for classification or clustering.
Proposed Research Questions¶
Classification¶
RQ1: Identification of fake news, hate speech or spam + Interpretability of results¶
Data:
- Fake and Real News Dataset (Kaggle)
- Hate Speech Dataset (GitHub)
(Papers with Code link) - YouTube Spam Collection (UCI)
Goal:
Evaluate performance of different methods and interpret the results using LIME.
RQ2: Evaluate the importance of metadata. Create a classification system to identify the movie genre using and excluding metadata.¶
Data:
Wikipedia Movie Plots (Kaggle)
Options:
- Create two classification systems, one using only metadata, one using only text. Stack them to create the best model:
StackingClassifier - scikit-learn - Use the functional API of Keras to create one model that handles both types of inputs:
Keras multiple inputs tutorial
Goal:
Evaluate performance and interpret the results using LIME.
Clustering¶
RQ3: Create a recommendation system for movies based on their plot¶
Data:
Wikipedia Movie Plots (Kaggle)
Output:
What are the closest movies to The Shawshank Redemption, Goodfellas, and Harry Potter and the Sorcerer's Stone?
RQ4: Cluster headlines using word embeddings¶
Data:
Good News Everyone Corpus (IMS Stuttgart)
(Paper PDF)
Research question:
Do the clusters correlate to emotions or media sources?
Additional Resources for Text Analysis Datasets¶
- UCI Machine Learning Repository (Text Datasets)
- Papers with Code Text Datasets
- Kaggle NLP Datasets (with code examples)
Note: Given time restrictions, choosing one of the above recommended datasets is advised.
# Upgrade core libraries
!pip install --upgrade -q numpy pandas scipy matplotlib scikit-learn tensorflow tensorflow-hub tensorflow-datasets vaderSentiment gensim lime scikeras keras-preprocessing
%matplotlib inline
# Data wrangling
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
# TensorFlow and Keras
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras import layers, utils
from tensorflow.keras.utils import to_categorical
# Scikit-learn tools
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import LabelEncoder
# Interpretable AI
from lime.lime_text import LimeTextExplainer
# Sentiment Analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# Scikit-learn + Keras bridge
from scikeras.wrappers import KerasClassifier
2025-07-04 11:26:20.175172: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-07-04 11:26:20.176757: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used. 2025-07-04 11:26:20.180752: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used. 2025-07-04 11:26:20.191497: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1751621180.208852 922 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1751621180.214015 922 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered W0000 00:00:1751621180.227880 922 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1751621180.227895 922 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1751621180.227897 922 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1751621180.227898 922 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. 2025-07-04 11:26:20.232628: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
RQ1: Identification of Hate Speech¶
Data Sources:¶
Hate Speech Dataset:
GitHub Repository
(Papers with Code)Fake vs Real News Dataset:
KaggleYouTube Spam Messages Dataset:
UCI Repository
Task:¶
We provide code for the hate speech dataset. Your goal is to improve the classifier by using a more advanced method.
Dataset Description:¶
The hate speech dataset contains sentences annotated for hate speech on English Internet forum posts.
The source forum is Stormfront, a large online community of white nationalists.
A total of 10,568 sentences were extracted and labeled as either conveying hate speech or not.
Step 1:¶
- Read the data
- Create a train-test split
zip_path = "Data_application.zip"
with zipfile.ZipFile(zip_path) as z:
print(z.namelist())
['Data_application/', 'Data_application/rq1_fake_news.csv.gzip', 'Data_application/rq1_hate_speech.csv.gzip', 'Data_application/rq1_youtube.csv.gzip', 'Data_application/rq2_3_wiki_movie_plots.csv.gzip', 'Data_application/rq4_gne-release-v1.0.csv.gzip', 'Data_application/rq5_signalmedia.csv.gzip']
df = pd.read_csv("rq1_hate_speech.csv.gzip", sep="\t", compression="gzip", index_col=0)
df["label"] = df["label"].map({"hate": 1, "noHate": 0})
df = df[["text", "label"]].dropna()
print(df.shape)
df.head()
(10703, 2)
| text | label | |
|---|---|---|
| file_id | ||
| 12834217_1 | As of March 13th , 2014 , the booklet had been... | 0.0 |
| 12834217_2 | In order to help increase the booklets downloa... | 0.0 |
| 12834217_3 | ( Simply copy and paste the following text int... | 0.0 |
| 12834217_4 | Click below for a FREE download of a colorfull... | 1.0 |
| 12834217_5 | Click on the `` DOWNLOAD ( 7.42 MB ) '' green ... | 0.0 |
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(df["text"].values, df["label"].values, test_size=0.33, random_state=42)
Step 2: Create Pipeline and Hyperparameter Tuning¶
Create a pipeline that vectorizes the text data, transforms it using TF-IDF, and classifies the news titles using LogisticRegression.
# Pipeline
pipe = Pipeline([
('vectorizer', TfidfVectorizer(stop_words='english', #remove stopwords
lowercase=True, #convert to lowercase
token_pattern=r'(?u)\b[A-Za-z][A-Za-z]+\b')), #tokens of at least 2 characters
('clf', LogisticRegression(max_iter=10000, dual=False, solver="saga")) #logistic regression
])
# Parameters to hyptertune
param_grid = dict(vectorizer__ngram_range=[(1,1), (1,2), (1,3)], # creation of n-grams
vectorizer__min_df=[1, 10, 100], # minimum support for words
clf__C=[0.1, 1, 10, 100], # regularization
clf__penalty=["l2","l1"]) # type of regularization
# Run a grid search using cross-validation to find the best parameters
grid_search = GridSearchCV(pipe, param_grid=param_grid, verbose=True, n_jobs=-1)
# to speed it up we find the hyperparameters using a sample, and fit on the entire datast later
grid_search.fit(X_train[:1000], y_train[:1000])
# best parameters, score and estimator
print(grid_search.best_params_)
print(grid_search.best_score_)
Fitting 5 folds for each of 72 candidates, totalling 360 fits
{'clf__C': 10, 'clf__penalty': 'l2', 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}
0.893
/opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:516: FitFailedWarning:
120 fits failed out of a total of 360.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
120 fits failed with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/opt/conda/lib/python3.11/site-packages/sklearn/base.py", line 1363, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/sklearn/pipeline.py", line 653, in fit
Xt = self._fit(X, y, routed_params, raw_params=params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/sklearn/pipeline.py", line 587, in _fit
X, fitted_transformer = fit_transform_one_cached(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/joblib/memory.py", line 326, in __call__
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/sklearn/pipeline.py", line 1539, in _fit_transform_one
res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/sklearn/feature_extraction/text.py", line 2104, in fit_transform
X = super().fit_transform(raw_documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/sklearn/base.py", line 1363, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/sklearn/feature_extraction/text.py", line 1389, in fit_transform
X = self._limit_features(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/sklearn/feature_extraction/text.py", line 1241, in _limit_features
raise ValueError(
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
warnings.warn(some_fits_failed_message, FitFailedWarning)
/opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:1135: UserWarning: One or more of the test scores are non-finite: [0.892 0.892 0.892 0.892 0.892 0.892 nan nan nan 0.892 0.892 0.892
0.892 0.892 0.892 nan nan nan 0.892 0.892 0.892 0.891 0.891 0.891
nan nan nan 0.889 0.892 0.892 0.889 0.889 0.889 nan nan nan
0.891 0.893 0.891 0.884 0.884 0.884 nan nan nan 0.882 0.866 0.851
0.883 0.883 0.883 nan nan nan 0.885 0.891 0.892 0.882 0.882 0.882
nan nan nan 0.88 0.873 0.861 0.88 0.88 0.88 nan nan nan]
warnings.warn(
# print resutls
results = pd.DataFrame(grid_search.cv_results_)
results.sort_values(by="mean_test_score", ascending=False).head(10)
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_clf__C | param_clf__penalty | param_vectorizer__min_df | param_vectorizer__ngram_range | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37 | 0.089452 | 0.002624 | 0.003111 | 0.000323 | 10.0 | l2 | 1 | (1, 2) | {'clf__C': 10, 'clf__penalty': 'l2', 'vectoriz... | 0.895 | 0.900 | 0.895 | 0.885 | 0.89 | 0.893 | 0.005099 | 1 |
| 0 | 0.038958 | 0.002399 | 0.002621 | 0.000166 | 0.1 | l2 | 1 | (1, 1) | {'clf__C': 0.1, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
| 2 | 0.114242 | 0.029817 | 0.003639 | 0.000287 | 0.1 | l2 | 1 | (1, 3) | {'clf__C': 0.1, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
| 1 | 0.068302 | 0.004164 | 0.003067 | 0.000079 | 0.1 | l2 | 1 | (1, 2) | {'clf__C': 0.1, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
| 4 | 0.021300 | 0.000588 | 0.002939 | 0.000065 | 0.1 | l2 | 10 | (1, 2) | {'clf__C': 0.1, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
| 5 | 0.024201 | 0.000310 | 0.003755 | 0.000927 | 0.1 | l2 | 10 | (1, 3) | {'clf__C': 0.1, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
| 9 | 0.011019 | 0.000311 | 0.002468 | 0.000140 | 0.1 | l1 | 1 | (1, 1) | {'clf__C': 0.1, 'clf__penalty': 'l1', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
| 3 | 0.017754 | 0.000729 | 0.002298 | 0.000059 | 0.1 | l2 | 10 | (1, 1) | {'clf__C': 0.1, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
| 11 | 0.025432 | 0.000888 | 0.003723 | 0.000582 | 0.1 | l1 | 1 | (1, 3) | {'clf__C': 0.1, 'clf__penalty': 'l1', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
| 12 | 0.008889 | 0.000113 | 0.002319 | 0.000030 | 0.1 | l1 | 10 | (1, 1) | {'clf__C': 0.1, 'clf__penalty': 'l1', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
# Use the best parameters in the pipe and fit with the entire dataset
pipe = pipe.set_params(**grid_search.best_params_)
clf_best = pipe.fit(X_train, y_train)
# print vocabulary size
print(len(clf_best["vectorizer"].get_feature_names_out()))
#vocabulary
#clf_best["vectorizer"].vocabulary_
# the best score achieved
print(clf_best.score(X_train, y_train))
# the best score achieved
print(clf_best.score(X_test, y_test))
53376 0.9993027471761261 0.8958097395243488
# Add predicitons to dataframe
df["predicted"] = clf_best.predict(df["text"])
df["predicted_prob_hate"] = clf_best.predict_proba(df["text"])[:,1]
df
| text | label | predicted | predicted_prob_hate | |
|---|---|---|---|---|
| file_id | ||||
| 12834217_1 | As of March 13th , 2014 , the booklet had been... | 0.0 | 0.0 | 0.017530 |
| 12834217_2 | In order to help increase the booklets downloa... | 0.0 | 0.0 | 0.018839 |
| 12834217_3 | ( Simply copy and paste the following text int... | 0.0 | 0.0 | 0.012751 |
| 12834217_4 | Click below for a FREE download of a colorfull... | 1.0 | 1.0 | 0.692506 |
| 12834217_5 | Click on the `` DOWNLOAD ( 7.42 MB ) '' green ... | 0.0 | 0.0 | 0.016758 |
| ... | ... | ... | ... | ... |
| 33676864_5 | Billy - `` That guy would n't leave me alone ,... | 0.0 | 0.0 | 0.057358 |
| 33677019_1 | Wish we at least had a Marine Le Pen to vote f... | 0.0 | 0.0 | 0.048570 |
| 33677019_2 | Its like the choices are white genocide candid... | 0.0 | 0.0 | 0.040078 |
| 33677053_1 | Why White people used to say that sex was a si... | 1.0 | 0.0 | 0.112932 |
| 33677053_2 | Now I get it ! | 0.0 | 0.0 | 0.042996 |
10703 rows × 4 columns
# Extract the coeficients from the omdel
coefs = pd.DataFrame([clf_best["vectorizer"].get_feature_names_out(),
clf_best["clf"].coef_[0]]).T
coefs.columns = ["gram","coef"]
# top words influencing hate
display(coefs.sort_values(by="coef", ascending=False).head(10))
# top words influencing non-hate
display(coefs.sort_values(by="coef", ascending=True).head(10))
| gram | coef | |
|---|---|---|
| 31434 | negroes | 8.12501 |
| 4218 | black | 7.266724 |
| 26064 | liberals | 6.231855 |
| 18671 | groid | 6.173916 |
| 15506 | filth | 6.04755 |
| 40573 | scum | 6.041375 |
| 1723 | ape | 5.951154 |
| 1732 | apes | 4.900051 |
| 686 | africa | 4.816671 |
| 30570 | mud | 4.781125 |
| gram | coef | |
|---|---|---|
| 53076 | youtube | -3.438041 |
| 39652 | said | -2.730757 |
| 52725 | year | -2.449511 |
| 46099 | thanks | -2.224181 |
| 46670 | thread | -2.208948 |
| 30729 | music | -2.147386 |
| 27704 | lot | -2.083565 |
| 8422 | comes | -2.039134 |
| 19172 | hair | -2.016673 |
| 30983 | nationalist | -1.958121 |
Interpretation of coefficients using LIME (Local Interpretable Model-Agnostic Explanations)¶
LIME modifies the text to understand the impact of each word to the predictions.
# Find some extreme examples
df_confused = df.loc[df["label"] != df["predicted"]]
pred_hate_not_hate = (df_confused.loc[df_confused["label"]==0].sort_values(by="predicted_prob_hate").tail(1).values[0][0])
pred_not_hate_hate = df_confused.loc[df_confused["label"]==1].sort_values(by="predicted_prob_hate").head(1).values[0][0]
print("Here")
less_hate = df.sort_values(by="predicted_prob_hate").head(1).values[0][0]
most_hate = df.sort_values(by="predicted_prob_hate").tail(1).values[0][0]
pred_50_50 = "She says the class is out of control and the kids are unteachable , and the black administration does not support her "
print("Least hate: ", less_hate)
print("Most hate: ", most_hate)
print("Predicted very hate but not hateful: ", pred_hate_not_hate)
print("Predicted very innocuous but hateful: ", pred_not_hate_hate)
print("Predicted 50/50: ", pred_50_50)
Here Least hate: - YouTube Most hate: Look what happens when Whites leave black Countries alone to do what they do naturally The blacks in White Countries today should be on their knees thanking Whites for trying to civilize them Predicted very hate but not hateful: Too many whites think they deserve what negroes dish out because of guilt . Predicted very innocuous but hateful: https://www.stormfront.org/forum/t1020784/ https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden God save them ..... Predicted 50/50: She says the class is out of control and the kids are unteachable , and the black administration does not support her
# start the explainer
explainer = LimeTextExplainer(class_names = ["Innocuous", "Hateful"], bow=False)
# shows the explanation for our example instances
for text in [less_hate, most_hate, pred_hate_not_hate, pred_not_hate_hate, pred_50_50]:
exp = explainer.explain_instance(text,
clf_best.predict_proba,
num_features = 10,
num_samples = 1000)
exp.show_in_notebook(text=text)
print(exp.as_list())
print("-"*100)
[('YouTube', -0.008337653307477069)]
----------------------------------------------------------------------------------------------------
[('black', 0.153676083840444), ('leave', 0.09873414409796714), ('Whites', 0.08420107092547535), ('Whites', 0.07686097330912474), ('blacks', 0.0658961961680142), ('today', -0.06380935040621671), ('Countries', 0.04662274208533168), ('knees', -0.04646490785484882), ('Countries', 0.04283281579844823), ('civilize', 0.019074622331620283)]
----------------------------------------------------------------------------------------------------
[('negroes', 0.4988126200655552), ('whites', 0.16825476534544995), ('guilt', 0.027831123400940575), ('think', -0.02273559427467263), ('out', 0.017615104078840035), ('they', 0.017330183719354523), ('Too', -0.013405004087654994), ('of', 0.012143130555456386), ('deserve', 0.008126777212364046), ('many', -0.00742166518130125)]
----------------------------------------------------------------------------------------------------
[('www', -0.0037638002374788177), ('www', -0.003225165954635936), ('sweden', 0.0011652656454728223), ('sweden', 0.00116208288851831), ('sweden', 0.0011218100169944469), ('sweden', 0.0009988607811029617), ('sweden', 0.0008649555102714504), ('sweden', 0.0008327781579675137), ('sweden', 0.0007638027083023633), ('God', 0.0002840305582186586)]
----------------------------------------------------------------------------------------------------
[('black', 0.4166909189615143), ('control', 0.17721661658157137), ('administration', -0.1316922913324194), ('class', -0.0921299350163489), ('kids', -0.060055613301412354), ('does', -0.050410279942509933), ('says', 0.02523400772680399), ('support', 0.01632736149301835), ('She', 0.0073769148863140205), ('her', 0.003947671066317231)]
----------------------------------------------------------------------------------------------------
exp = explainer.explain_instance("I believe Dutch people have inferior food and they should be colonized by Belgium",
clf_best.predict_proba,
num_features = 10,
num_samples = 1000)
exp.show_in_notebook(text=text)
print(exp.as_list())
print("-"*100)
[('people', -0.009334582632402822), ('food', 0.008685846579016096), ('inferior', 0.0071903598343557515), ('Belgium', -0.0035074247186144283), ('believe', -0.001246835012816647), ('colonized', -0.0001359449036872047), ('they', -0.00012765426499100103), ('be', 7.59926887779622e-05), ('Dutch', -6.822514059324273e-05), ('by', -5.116923577947203e-05)]
----------------------------------------------------------------------------------------------------
Note on model.summary()¶
Call model.summary() after model.fit() or once the model has processed data because:
- Keras builds the model only when it sees input data.
- Before that, the summary shows unknown shapes and zero parameters.
- After building, the summary displays correct shapes and parameter counts.
This ensures an accurate and complete model summary.